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Abstract 

We study robust and efficient distributed algorithms for searching, storing, and maintaining 
data in dynamic Peer-to-Peer (P2P) networks. P2P networks are highly dynamic networks that 
experience heavy node churn (i.e., nodes join and leave the network continuously over time). 
Our goal is to guarantee, despite high node churn rate, that a large number of nodes in the 
network can store, retrieve, and maintain a large number of data items. Our main contributions 
are fast randomized distributed algorithms that guarantee the above with high probability even 
under high adversarial churn. In particular, we present the following main results: 

1. A randomized distributed search algorithm that with high probability guarantees that 
searches from as many as n — o(n) nodes (n is the stable network size) succeed in 0(log n)- 
rounds despite 0(n/ \og 1+S n) churn, for any small constant S > 0, per round. We assume 
that the churn is controlled by an oblivious adversary (that has complete knowledge and 
control of what nodes join and leave and at what time and has unlimited computational 
power, but is oblivious to the random choices made by the algorithm). 

2. A storage and maintenance algorithm that guarantees, with high probability, data items 
can be efficiently stored (with only O(logn) copies of each data item) and maintained 
in a dynamic P2P network with churn rate up to 0(n/ log 1+S n) per round. Our search 
algorithm together with our storage and maintenance algorithm guarantees that as many 
as n — o(n) nodes can efficiently store, maintain, and search even under 0(n/ log 1+<5 n) 
churn per round. Our algorithms require only polylogarithmic in n bits to be processed 
and sent (per round) by each node. 

To the best of our knowledge, our algorithms are the first-known, fully-distributed storage 
and search algorithms that provably work under highly dynamic settings (i.e., high churn rates 
per step). Furthermore, they are localized (i.e., do not require any global topological knowledge) 
and scalable. A technical contribution of this paper, which may be of independent interest, is 
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showing how random walks can be provably used to derive scalable distributed algorithms in 
dynamic networks with adversarial node churn. 

Keywords: Peer-to-Peer network, Dynamic network, Search, Storage, Distributed algorithm, Ran- 
domized algorithm, Random Walks, Expander graph. 
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1 Introduction 



Peer-to-peer (P2P) computing is emerging as one of the key networking technologies in recent years 
with many application systems, e.g., Skype, Bit Torrent, Cloudmark, CrashPlan, Symform etc. For 
example, systems such as CrashPlan [16] and Symform [57] are relatively recent P2P-based storage 
services that allow data to be stored and retrieved among peers [35]. Such data sharing among 
peers avoids costly centralized storage and retrieval, besides being inherently scalable to millions 
of peers. However, many of these systems are not fully P2P; they also use dedicated centralized 
servers in order to guarantee high availability of data — this is necessary due to the highly dynamic 
and unpredictable nature of P2P. Indeed, a key reason for the lack of fully-distributed P2P systems 
is the difficulty in designing highly robust algorithms for large-scale dynamic P2P networks. 

P2P networks are highly dynamic networks characterized by high degree of node churn — i.e., 
nodes continuously join and leave the network. Connections (edges) may be added or deleted 
at any time and thus the topology changes very dynamically. In fact, measurement studies of 
real-world P2P networks |21|, [25| [53"1 [56] show that the churn rate is quite high: nearly 50% of 
peers in real-world networks can be replaced within an hour. (However, despite a large churn 
rate, these studies also show that the total number of peers in the network is relatively stable.) 
P2P algorithms have been proposed for a wide variety of tasks such as data storage and retrieval 
[IHJ UHl [181 HU [26] , collaborative filtering |TT] , spam detection [15] , data mining [T7] , worm detection 
and suppression [38j [59] , privacy protection of archived data [23] , and recently, for cloud computing 
services as well [571 [8]. However, all algorithms proposed for these problems have no theoretical 
guarantees of being able to work in a dynamically changing network with a very high churn rate, 
which can be as much as linear (in the network size) per round. This is a major bottleneck in 
implementation and wide-spread use of P2P systems. 

In this paper, we take a step towards designing provably robust and scalable algorithms for 
large-scale dynamic P2P networks. In particular, we focus on the fundamental problem of stor- 
ing, maintaining, and searching data in P2P networks. Search in P2P networks is a well-studied 
fundamental application with a large body of work in the last decade or so, both in theory and 
practice (e.g., see the survey [36])- While many P2P systems/protocols have been proposed for 
efficient search and storage of data (cf. Section fl .3[) . a major drawback of almost all these is the 
lack of algorithms that work with provably guarantees under a large amount of churn per round. 
The problem is especially challenging since the goal is to guarantee that almost all nodea3 are able 
to efficiently store, maintain, and retrieve data, even under high churn rate. In such a highly dy- 
namic setting, it is non-trivial to even just store data in a persistent manner; the churn can simply 
remove a large fraction of nodes in just one time step. On the other hand, it is costly to replicate 
too many copies of a data item to guarantee persistence. Thus the challenge is to use as little 
storage as possible and maintain the data for a long time, while at the same time designing efficient 
search algorithms that find the data quickly, despite high churn rate. Another complication to this 
challenge is designing algorithms that are also scalable, i.e., nodes that process and send only a 
small number of (small-sized) messages per round. 



*In sparse, bounded- degree networks, as assumed in this paper, an adversary can always isolate some number of 
nodes due to churn, hence "almost all" is the best one can hope for in such networks. 



1 



1.1 Our Main Results 



We provide a rigorous theoretical framework for the design and analysis of storage, maintenance, 
and retrieval algorithms for highly dynamic distributed systems with churn. We briefly describe 
the key ingredients of our model here. (Our model is described in detail in Section [2]). Essentially, 
we model a P2P network as a bounded-degree expander graph whose topology — both nodes and 
edges — can change arbitrarily from round to round and is controlled by an adversary. However, 
we assume that the total number of nodes in the network is stable. The number of node changes per 
round is called the churn rate or churn limit. We consider a churn rate of up to some 
where 5 > is any small constant and n is the stable network size. Note that our model is quite 
general in the sense that we only assume that the topology is an expander at every step; no 
other special properties are assumed. Indeed, expanders have been used extensively to model 
dynamic P2P networks^ in which the expander property is preserved under insertions and deletions 
of nodes (e.g., [21 [35l I44j ) . Since we do not make assumptions on how the topology is preserved, 
our model is applicable to all such expander-based networks. (We note that various prior work on 
dynamic network models (e.g., [3j [331 El EH]) make similar assumptions on preservation of topological 
properties — such as connectivity, high expansion etc. — at every step under insertions/deletions 
- cf. Section 11.31 The issue of how such properties are preserved are abstracted away from the 
model, which allows one to focus on the dynamism. Indeed, this abstraction has been a feature of 
most dynamic models e.g., see the survey of [12].) 

Our main contributions are efficient randomized distributed algorithms for searching, storing, 
and maintaining data in dynamic P2P networks. Our algorithms succeed with high probability 
(i.e., with probability 1 — l/n^ 1 ), where n is the stable network size)) even under high adversarial 
churn in a polylogarithmic number of rounds. In particular, we present the following results (the 
precise theorem statements are given in Section U|) : 

1. (cf. Theorem [3]) A storage and maintenance algorithm that guarantees, with high probability, 
that data items can be efficiently stored (with only O(logn) copies of each data item [§[) 
and maintained in a dynamic P2P network with churn rate up to 0{nj log 1+<5 n) per round, 
assuming that the churn is controlled by an oblivious adversary (that has complete knowledge 
and control of what nodes join and leave and at what time and has unlimited computational 
power, but is oblivious to the random choices made by the algorithm). 

2. (cf. Theorem U]) A randomized distributed search algorithm that with high probability guar- 
antees that searches from as many as n — o(n) nodes succeed in 0(log n)-rounds under up to 
0(n/ log 1 " 1 " 5 n) churn per round. Our search algorithm together with the storage and main- 
tenance algorithm guarantees that as many as n — o(n) nodes can efficiently store, maintain, 
and search even under 0(n/ log 1+<5 n) churn per round. Our algorithms require only polylog- 
arithmic in n bits to be processed and sent (per round) by each node. 

To the best of our knowledge, our algorithms are the first-known, fully-distributed storage 
and search algorithms that work under highly dynamic settings (i.e., high churn rates per step). 
Furthermore, they are localized (i.e., do not require any global topological knowledge) and scalable. 



^Throughout this paper, we use log to represent natural logarithm unless explicitly specified otherwise. 

'A number of works on static networks have used expander graph topologies to solve the agreement and related 
problems 29, 20, 58] . Here we show that similar expansion properties are beneficial in the more challenging setting 
of dynamic networks (cf. Section [L3|. 

''Using erasure coding techniques, the number of bits stored can be reduced even further, so as to incur only a 
constant factor overhead. We discuss this in Section H. 41 
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1.2 Technical Contributions 

We derive techniques (cf. Section [3]) for doing scalable distributed computation in highly dynamic 
networks. In such networks, we would like distributed algorithms to work correctly and efficiently, 
and terminate even in networks that keep changing continuously over time (not assuming any 
eventual stabilization). The main technical tool that we use is random walks. Flooding techniques 
(which proved useful in solving the agreement problem under high adversarial churn [2]) are not 
useful for search as they generate lot of messages and hence are not scalable. We note that random 
walks have been used before to perform search in P2P networks (e.g., [40 \ [371 l6T| I24j ) as well for 
other applications such as sampling (e.g., [HEl]), but these are not applicable to dynamic networks 
with large churn rates. 

One of the main technical contributions of this paper is showing how random walks can be 
used in a dynamic network with high adversarial node churn (cf . Section [3]) . The basic idea is 
quite simple and is as follows. All nodes generate tokens (which contain the source node's ids) 
and send it via random walks continuously over time. These random walks, once they "mix" 
(i.e., reach close to the stationary distribution), reach essentially "random" destinations in the 
network; we (figuratively) call these simultaneous random walks as a soup of random walks. Thus 
the destination nodes receive a steady stream of tokens from essentially random nodes, thereby 
allowing them to sample nodes uniformly from the network. While this is easy to establish in a 
static network, it is no longer true in a dynamic network with adversarial churn — the churn can 
cause many random walks to be lost and also might introduce bias. We show a technical result 
called the "Soup Theorem" (cf . Theorem [1]) that shows that "most" random walks do mix (despite 
large adversarial churn) and have the usual desirable properties as in a static network. We use 
the Soup theorem crucially in our search, storage, and maintenance algorithms. We note that our 
technique can handle churn only up to nj polylogn. Informally, this is due to the fact that at least 
il(logn) rounds are needed for the random walks to mix, before which any non-trivial computation 
can be performed. This seems to be a fundamental limitation of our random walk based method. 
We come close to this limit in that we allow churn to be as high as 0(nj log 1+<5 n) for any fixed 
5 > 0. 

Another technique that we use as a building block in our algorithms is construction and main- 
tenance of (small-sized) committees. A committee is a clique of small (0(logn)) size composed of 
essentially "random" nodes. We show how such a committee can be efficiently constructed, and 
more importantly, maintained under large churn. A committee can be used to "delegate" a storage 
or a search operation; its small size guarantees scalability, while its persistence guarantees that 
the operation will complete successfully despite churn. Our techniques (the Soup Theorem and 
committees) can be useful in other distributed applications as well, e.g., leader election. 

1.3 Related Work 

There has been significant prior work in designing P2P networks that are provably robust to a large 
number of Byzantine faults (e.g., see [22j [27J 021 EH [7] ) . These focus on robustly enabling storage 
and retrieval of data items under adversarial nodes. However, these algorithms will not work in 
a highly dynamic setting with large, continuous, adversarial churn (controlled by an all-powerful 
adversary that has full control of the network topology, including full knowledge and control of what 
nodes join and leave and at what time and has unlimited computational power). Most prior works 
develop algorithms that will work under the assumption that the network will eventually stabilize 
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and stop changing. (An important aspect of our algorithms is that they will work and terminate 
correctly even when the network keeps continually changing.) There has been a lot of work on 
P2P algorithms for maintaining desirable properties (such as connectivity, low diameter, bounded 
degree) under churn (see e.g., [3U [28J [34], but these don't work under large adversarial churn 
rates. In particular, there has been very little work till date that rigorously addresses distributed 
computation in dynamic P2P networks under high node churn. The work ((31]) raises the open 
question of whether one can design robust P2P protocols that can work in highly dynamic networks 
with a large adversarial churn. The recent work of [2] was one of the first to address the above 
question; its focus was on solving the fundamental agreement problem in P2P networks with very 
large adversarial churn. However, the paper does not address the problem of search and storage, 
which was a problem left open in [2]. 

There has been significant work in the design of P2P systems for doing efficient search. These 
can be classified into two categories — (1) Distributed Hash Table (DHT)-based schemes (also 
called "structured" schemes) and (2) unstructured schemes; we refer to [36] for a detailed survey. 
However much of these systems have no provable performance guarantees under large adversarial 
churn. DHT schemes create a fully decentralized index that maps data items to peers and allows 
a peer to search for a data item efficiently without flooding. In unstructured networks, there is 
no relation between the data identifier and the peer where it resides. There also have been a lot 
of work on search in unstructured network topologies, see e.g., |37[ 140] and the references therein. 
Our algorithms assume an unstructured network. 

There has been works on building fault-tolerant Distributed Hash Tables (which are classified 
as "structured" P2P networks unlike ours which are "unstructured" e.g., see [H]) under different 
deletion models — adversarial deletions and stochastic deletions. The structured P2P network 
described by Saia et al. [49J guarantees that a large number of data items are available even if a large 
fraction of arbitrary peers are deleted, under the assumption that, at any time, the number of peers 
deleted by an adversary must be smaller than the number of peers joining. Kuhn et al. consider 
in [M] that up to O(logn) nodes (adversarially chosen) can crash or join per constant number of 
time steps. Under this amount of churn, it is shown in [34J how to maintain a low peer degree 
and bounded network diameter in P2P systems by using the hypercube and pancake topologies. 
Scheideler and Schmid show in [52] how to maintain a distributed heap that allows join and leave 
operations and, in addition, is resistent to Sybil attacks. A robust distributed implementation 
of a distributed hash table (DHT) in a P2P network is given by [7], which can withstand two 
important kind of attacks: adaptive join-leave attacks and adaptive insert/lookup attacks by up to 
en adversarial peers. This paper assumes that the good nodes always stay in the system and the 
adversarial nodes are churned out and in, but the algorithm determines where to insert the new 
nodes. Naor and Weider [33] describe a simple DHT scheme that is robust under the following 
simple random deletion model — each node can fail independently with probability p. They show 
that their scheme can guarantee logarithmic degree, search time, and message complexity if p is 
sufficiently small. Hildrum and Kubiatowicz [27] describe how to modify two popular DHTs, Pastry 
[35] and Tapestry [60] to tolerate random deletions. Several DHT schemes (e.g., [551 H7[ [30] ) have 
been shown to be robust under the simple random deletion model mentioned above. There also 
have been works on designing fault-tolerant storage systems in a dynamic setting using quorums 
(e.g., see [HJ [42]). However, these do not apply to our model of continuous churn. 

To address the unpredictable and often unknown nature of network dynamics, [33] study a model 
in which the communication graph can change completely from one round to another, with the only 
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constraint being that the network is connected at each round. The model of [33] allows for a much 
stronger adversary than the ones considered in past work on general dynamic networks [H El E] . 
The surveys of |32[ [13] summarizes recent work on dynamic networks. 

The dynamic network model of [331 El [50] allows only edge changes from round to round while 
the nodes remain fixed. In our work, we study a dynamic network model where both nodes and 
edges can change by a large amount. Therefore, the framework we study in Section [2] (and first 
introduced in [2]) is more general than the model of |33| . as it is additionally applicable to dynamic 
settings with node churn. We note that the works of [U [50] study random walks under a dynamic 
model where the nodes are fixed (and only edges change) and hence not applicable to systems with 
churn. The surveys of |32[ [13] summarizes recent work on dynamic networks. 

Expander graphs and spectral properties have already been applied extensively to improve 
the network design and fault-tolerance in distributed computing in general (|58| [20], [9]) and P2P 
networks in particular |31[ [2] . The problem of achieving almost-everywhere agreement among nodes 
in P2P networks — modeled as an expander graph — is considered by King et al. in [31] in the 
context of the leader election problem. However, the algorithm of [31] does not work for dynamic 
networks. The work of [2] addresses the agreement problem in a dynamic P2P network under an 
adversarial churn model where the churn rates can be very large, up to linear in the number of 
nodes in the network. It also crucially makes use of expander graphs. 

2 Model and Problem Statement 
2.1 Dynamic Network Model 

We consider a synchronous dynamic network with churn represented by a dynamically changing 
graph whose edges represent connectivity in the network. Our model is similar to the one introduced 
in [2]. The computation is structured into synchronous rounds, i.e., we assume that nodes run at 
the same processing speed and any message that is sent by some node u to its (current) neighbors 
in some round r ^ 1 will be received by the end of r. To ensure scalability, we restrict the number 
of bits sent per round by each node to be polylogarithmic in n, the stable network size. In each 
round, up to 0(nj log 1+<5 n) nodes can be replaced by new nodes, for any small constant 5 > 0. 
Furthermore, we allow the edges to change arbitrarily in each round, but the underlying graph must 
be a D-regular non-bipartite expander graph (d can be a constant). (The regularity assumption 
can be relaxed, e.g., it is enough for nodes to have approximately equal degrees, and our results 
can be extended.) The churn and edge changes are made by an adversary that is oblivious to 
the state of the nodes. (In particular, it does not know the random choices made by the nodes.) 
More precisely, the dynamic network is represented by a sequence of graphs Q = (G°, G , . . .). We 
assume that the adversary commits to this sequence of graphs before round 0, but the algorithm 
is unaware of the sequence. Each G r = (V r ,E r ) has n nodes. We require that for all r ^ 0, 
\V r \ V r+1 \ = \V r+1 \ V r \ ^ 0{nj log 1+<5 n). Furthermore, each G r must be a D-regular non- 
bipartite expander with a fixed upper bound of A on the second largest eigenvalue in absolute 
value. 

A node u can communicate with any node v if u knows the id of v When a new node joins 
the network, it has only knowledge of the ids of its current neighbors in the network and thus can 



"This is a typical assumption in the context of P2P networks, where a node can establish communication with 
another node if it knows the other node's IP address. 
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communicate with them. We note that communication can be highly unreliable due to churn, since 
when u sends a message to v there is no guarantee that v is still in the network. However, each 
node in the network is guaranteed to have D neighbors in the network at any round with whom 
it can reliably communicate in that round. We note that random walks always use the neighbor 
edges. 

The network is synchronous, so nodes operate under a common clock. The following sequence 
events occur in each round or time step r. Firstly, the adversary makes the necessary changes 
to the network, so the algorithm is presented with graph G T . So each node becomes aware of its 
neighbors in G r . Each node then exchanges messages with its neighbors. The nodes can perform 
any required computation at any time. Each node u has a unique identifier and is churned in at 
some round and churned out at some r Q > n. More precisely, for each node u, there is a maximal 
range [rj,r — 1] such that for every r 6 [fj, r — 1], u G V r and for every r ^ [rj,r Q — 1], u ^ V r . 
Any information about the network at large is only learned through the messages that u receives. 
It has no a priori knowledge about who its neighbors will be in the future. Neither does u know 
when (or whether) it will be churned out. For all r, we assume that \V r \ = n, where n is a suitably 
large positive integer. This assumption simplifies our analysis. Our algorithms can be adapted to 
work correctly as long as the number of nodes is reasonably stable. Also, we assume that n (or a 
constant factor estimate of n) is common knowledge among the nodes in the network. 

2.2 The Storage and Search Problem 

In simple terms, we want to build a robust distributed solution for the storage and retrieval of data 
items. Nodes can produce data items. Each data item is uniquely identified by an id (such as its 
hash value). When a node produces a data item, the network must be able to place and maintain 
copies of the data item in several nodes of the network. To ensure scalability, we want to upper 
bound the number of copies of each data item, but more importantly, we must also replicate the 
data sufficiently to ensure that, with high probability, the churn does not destroy all copies of a 
data item. When a node u requires a data item (whose id, we assume, is known to the node), it 
must be able to access the data item within a bounded amount of time. To keep things simple, 
we only require that u knows the id of a node (currently in the network) that has the data item u 
needs. We ideally want an arbitrarily large number of data items to be supported. 

3 Random Walks Under Churn 

As a building block for our solution to the storage and search problem, we study some basic 
properties of random walks in dynamic networks with churn. It is well-known that random walks 
on expander graphs exhibit fast mixing time, thus allowing near uniform sampling of nodes in 
the network. This behavior quite easily extends to expander networks in which edges change 
dynamically, but nodes are fixed [50J. It is more challenging to obtain such characteristics under 
networks under adversarial node churn. One issue is that random walks may not survive. The 
more challenging issue is that adversarial churn may bias the random walks, which in turn will bias 
sampling of nodes. We address both issues in our analysis. In particular, we show that for any 
time t, most of the random walks that were generated at time t survive up to time t + O(logn) 
and at that time the surviving walks are close to uniformly distributed among the existing nodes. 
We also show on the flip side that the origin of a random walk that survived in the network for 
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O(logn) rounds is uniformly distributed among nodes that existed at the time of its origin. We 
assume the churn rate to be at most 4n/log fc n; we show that k can be of the form 1 + 5 for any 
fixed 5 > 0. We begin by defining some terms and establishing some notations. 

Let ^1 be the uniform distribution vector that assigns probability 1/n to each of the n nodes in 
G l . We define tt(G, s, t, to), t ^ to ^ 0, to be the probability distribution vector of the position of a 
random walk in round t, given that the random walk started at s G G to in round to and proceeded 
to walk in the dynamic network Q. For our purposes, we will restrict our attention to random walks 
that start in round 0, so, for convenience, we use n(Q,s,t) to refer to n(G, s,t, 0). The component 
TVci(Q, s, t) refers to the probability that the random walk is at d G V 1 in round t. Since the random 
walk could have been terminated because of churn, we use n*(G,s,t) = 1 — Yldev* n d(G, s > *) to 
denote the probability that the random walk did not survive until round t. We are now ready 
to present a key ingredient in our algorithms, namely the Soup Theorem, which may also be of 
independent interest in dynamic graphs (and not just P2P d) with churn. 

Theorem 1 (Soup Theorem). Suppose that the churn is limited by 4ra/log fc n, where k = 1 + 5 
for any fixed 5 > 0. With high probability, there exists a set of nodes Core C V° n V 2t , with 
cardinality at least n — — (^\y 2 such that for any s G CORE and d G CORE, a random walk that 
starts in s terminates in d (in round 2r) with probability in [l/17n, 3/2n]. 

Given a dynamic network Q, for the purpose of analysing random walks, we construct a cor- 
responding random walk preserving dynamic network Q as follows. We initialize the network by 
setting G° to G°. Each G l G Q, t > 0, is essentially the same graph as G l except that we copy the 
state of each node that is churned out in round t on to a unique node that is churned in at round t. 
Intuitively, Q is a testbed network corresponding to Q in which, random walks that are eliminated 
in Q are preserved. Notice that there is a one to one correspondence between objects (vertices, 
edges, and graphs) in Q to the objects in Q. For notational clarity, we refer to the sequence of 
graphs in Q as (G , G 1 , . . .), where each G t = (V t ,E t ). Given a vertex v G V t (resp. e G E t , 
S C V*, etc), we denote the corresponding vertex in V* by v (resp., e G E l , S C V , etc). Let Q be 
the random walk preserving dynamic network constructed from a D-regular non-bipartite dynamic 
network Q. We get the following characterizations of random walks. 

Lemma 1. Suppose that an oblivious adversary fixes the dynamic network Q of D-regular non- 
bipartite expander graphs in advance and let Q be the corresponding preserving dynamic network. If 
each correct node s starts hlogn random walks of length T G O(logn) and forwards up to 2/ilogn 
random walk tokens per round, then there is a fixed t = Mlogn G 0(logn) ; for some constant 
M > 0, s.t. the following hold: 

(A) Every walk in Q completes T steps in r rounds w.h.p. 

(B) Every walk in Q that starts at some node s has probability in J-] of being at any node d 
after taking T steps. Formally, Ms G V° \/d G V T : ^- ^ ftd(G, s,r) ^ 

Proof. For any c > 0, we know by [50] that after taking T = T(c) G 0(logn) steps in a dynamic 
D-regular expander network with a changing edge topology, a random walk token has probability 
[h fW) n ^ — r^c] °f being at any particular node in step T, which shows (B). 



"in particular, the Soup theorem applies to any expander topology model with churn. It can also be extended 
to general connected networks, although the bounds will depend on the dynamic mixing time [50] of the underlying 
dynamic network. 
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Recalling that all graphs Q are D-regular and that initially each node generates hlogn tokens, 
the expected number of tokens received by a node is hlogn, in any round r. Applying a standard 
Chernoff bound, it follows that with probability 1 — ra -4 , each node receives at most 2/ilogn 
tokens in r. Taking a union bound over 2/ilogn rounds and all nodes, the same is true at every 
correct node during rounds [1, 2h log n] with probability ^ 1 — n -2 . Since each node can forward up 
to 2/ilogn tokens per round, we set M = 2h. This guarantees that (w.h.p.) every token is forwarded 
once in every round. Thus, with high probability, all walks complete all T steps in r = Mlogn 
rounds and satisfy the required probability bound. □ □ 

We call r the dynamic mixing time of a random walk. It follows therefore that when every 
G G Q is an expander with a second largest eigenvalue bounded by A which is is a fixed constant 
bounded away from 1, as we have assumed, then the mixing time t(Q, J-) = Mlogn, where M is 
a fixed constant known to all nodes in the network. In particular, for every s £ V° and d G V T , 

We first show that given a dynamic graph process Q, there is a large set of nodes at time 0, 
such that a random walk generated in each of these nodes at time survives up to the mixing time 
with probability 1 - l/log^' -1 ^ 2 n. 

Lemma 2. Consider a churn of An/ \og k n and let S = {s : (sg V°)A(ir il: (g i s, r) ^ 1/ log^" 1 ^ 2 n)}. 
Then, \S\ > n - 4n/ log (fc " 1)/2 n. 

Proof. Start one random walk from each node in V°. Each G 6 Q being D-regular, the expected 
number of random walks on any node in Q at any round is 1. Therefore, in Q, after r G O(logn) 
rounds, the expected number of random walks that will be eliminated is 

E 4n 
7T*(g,g,T)< - fc _ 1 , (1) 

sey o l°g fc 'n 

for some suitable constant c. Since tv*(G, s, t) ^ 1/ log^ fc_1 ^ 2 n for every s G V° \ S, we get 

1 \v°\s\z 4n 



logM/V ' log^n' 

This implies that \V° \ S\ ^ 4n/ log^ -1 ^' 2 n. Since |V°| = n, the lemma follows. □ 

Consider a random walk that started at time from any s G S. We now endeavor to show that 
if the random walk survives for the dynamic mixing time r then its destination will be close to 
uniformly distributed. 

Lemma 3. Suppose that we have a churn limit of 4n/log fe n, where k = 1 + 8 for any fixed 5 > 0. 
With high probability, there exists a set S C V° ( as defined in Lemma dp of cardinality at least 
n — ( fc l")/ 2 ~ with the property that for every s G S, there exists a D(s) C V T of cardinality at 

least n — (fc l") /2 ~ such that for every d G D(s), l/4n ^ 7Td(G,s,T) ^ 3/2n. 
Proof. A random walk that survives in Q also survives in Q, thus for every s G V° and d G V T , 

Kd(G,s,T) < 7r rf (^, s,r) ^ 3/2n. 
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The crux of the proof is showing that a random walk from s 6 S reaches a large number of nodes in 
V T , each with probability at least From Lemma[2j for every s£S, tt*(Q,s,t) ^ 1/ log^ -1 ^ 2 n. 
Consider a walk that started at time at s G S 1 . Summing over all the possible locations of the 
walk at time r we have 

J2^d(G,s,r)-7T d (g,s,r)) =7r*(g, Sj r) < l/log^^n. (2) 

To lower bound |-D(s)|, we upper bound the cardinality of the complement set 

.D = V T \ D(s) = {d:(d£ V T ) A fa(Q t s, r)) < l/4n)}. 
Since D C V T , we can restrict the summation in Equation [2] to the elements of D and get 

^(vr d (g, S ,r)-vr d (g, S ,r)) < 1/ log^ 1 )/ 2 n. (3) 

By Lemma [TJ (A) , with high probability all walks have completed r steps with high probability, 
and, by Lemma[TJ(B), 7Tci(G,s,t) ^ l/2n for every tZ G F T , but 7r^(^,s,r) ^ l/4n for every d £ D. 
Thus, for any s £ S and d £ D, 

^d(Q, s, r) - vr d (C/, s, r) ^ — . 

4re 

Plugging to equation © we have ^ (1/ log^' _1 )/ 2 n)/(l/4n) = An/ log^ -1 ^ 2 n, thus establish- 
ing the lemma. □ 

In Lemma O we studied the distribution of the destination of random walks originating from a 
large set S C V°. Similarly, in Lemma [H we formalize our understanding of the origin of random 
walks that terminate in some large set D C V T . 

Lemma 4 (Reversibility of Random Walks). Suppose that the churn is limited by An/ log k n. With 
high probability, there exists a set D C V T of cardinality at least n — - — (£=rjj2 such that, given a 

node d G D, there is a set S(d) C V° of cardinality at least n — (fc l") /2 ~ such that a random walk 
that terminated in d originated in any s G S{d) with probability in the range [A/An, 3/2n]. 

Proof. Notice first that the reverse sequence of graphs Q = {G T , G T_1 , . . . , G°) is a valid sequence 
of graphs that make up a dynamic network, albeit one that has a limited number of rounds to 
offer. Furthermore, let u and v be two neighbours in Q at some time t. Since G 1 is D-regular, the 
probability of a random walk on u moving to v equals the probability of a random walk moving 
from v to u. Therefore, to study the distribution of the origin of a random walk (in V ) that 
terminated in some node t in G T , we can initiate a random walk in the same node t in Q (in round 
0) and study the distribution of the random walk's destination in G° (in round r). Lemma [3] applies 
to Q implying that (w.h.p.) there exists sets D C V T and S C V°, both of cardinality at least 

n ft 4r !w 9 , such that a random walk originating in some d G D in round of Q terminated 

in some s G S in round r. The lemma follows when we view this random walk property from the 
perspective of Q. □ 

Combining Lemmata [3] and S] carefully we can complete the proof of Theorem [TJ 
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of TheoremUl The upper bound of 3/2n on the probability that s terminates in d follows quite 
easily from Lemma [3] when we note that the probability with which a random walk terminates at 
any node does not increase over time. Therefore, we focus on the lower bound. 

First, we choose D C V 2t based on Lemma [4] such that (w.h.p.) for any d G D, there is a set 
5(cQ C V T of cardinality at least n — t (fc-'i)/2 n such that, for every d' G D(d), a random walk 
that terminated in d originated from d! with probability at least l/4n. 

We then choose S C V° based on Lemma [3] such that (w.h.p.) for any s £ S, there is a set 
S(d) C V T of cardinality at least n — log (fc-")/2 w such that, for every s' G S , a random walk that 
originated in s terminates in s' with probability at least l/4n. 

Notice that the cardinality of for any s£S and d G D is at least n — - — (fc _" )/2 . 

We now fix a pair (s,d), where s G S and d G D and consider a random walk that terminated 
in d. The random walk was in some node in in round r with probability at least 



£ h = (V4) - o(l) £ (1/4) 



for any fixed e > 0. Let us now condition our random walk on the event that the random walk was 
on some x G in round r. Then, it originated from s with probability l/4n. Therefore, the 

random walk that terminated in d originated from s with probability (1/4 — e)(l/4n) ^ (l/17n) 
when e ^ 1/68. The theorem follows by setting Core = S n D because |Core| is also at least 

8n 

! (fc-l)/2, 



1 1 ci._i i /o — • LJ 



4 Storage and Search of Data 

In this section we describe a mechanism that enables all but o(n) nodes to persistently store data 
in the network. We will assume that churn rate is 4ra/log fc n. A key goal is to tolerate as much 
churn as possible, hence we would like k to be as small as possible. With this in mind, we again 
show in the analysis that k can be of the form 1 + 5 for any fixed 5 > 0. 

A nai've solution is to flood the data through the network and store it at a linear number of 
nodes, which guarantees fast retrieval and persistence with probability 1. Clearly, such an approach 
does not scale to large peer-to-peer networks due to the congestion caused by flooding and the costs 
of storing the item at almost every node. As we strive to design algorithms that are useful in large 
scale P2P-networks, we limit the amount of communication by using random walks instead of 
flooding and require only a sublinear number of nodes to be responsible for the storage of an item 
- only ©(logra) of these nodes will actually store the iterrF*! whereas the other nodes serve as 
landmarks pointing to these G(logra) nodes. 

Suppose that node u wants to store item I and assume that u is part of the large set of nodes 
Core provided by Theorem [lj which consists of nodes that are able to obtain (almost uniform) 
node id samples from the same set, despite churn. A well known solution is to make use of the 
birthday paradox: If node u is able to select Q{y/n\ogn) sample ids and assign these so called data 
nodes to store X, then X can be retrieved within y/n rounds by most nodes, with high probability. 
In our dynamic setting, up to 0(n/ log k n) nodes per round can be affected by churn, which means 



**In fact (as we noted earlier) using erasure coding techniques, the overall storage can be limited to a constant 
factor overhead; see Section \4. 41 
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that the number of data nodes might decrease rapidly. Care must be taken when replenishing the 
number of data nodes, as we need to ensure that the data nodes are chosen randomly and their 
total number does not exceed 0(y/n). A simple algorithm for estimating the actual number of data 
nodes is to require data nodes first to generate a random value from the exponential distribution 
with rate 1, then to aggregate the minimum generated value z by flooding it through the network 
(cf. [2]), and finally to compute the estimate as 1/z. The simplicity of the above approach comes 
at the price of requiring every node to participate (by flooding) in the storage of the item. 

We now describe an approach that avoids the above pitfalls and provides fast data retrieval 
and persistence with high probability, while limiting the actual number of nodes needed for storing 
a data item to 0(logn), while a large set of Q(y/n) nodes serve as so-called "landmarks". That 
is, a node v is a landmark for item I in r, if v knows the id of some node w 6 V r that stores 
X. Note that even if v was a landmark in r, it might no longer be a landmark in round r + 1 if 
w has been churned out at the beginning of r; moreover, v itself will not be aware of this change 
until it attempts to contact w. To facilitate the maintenance of a large set of randomly distributed 
landmarks, our algorithms construct a committee of ©(logn) nodes via the overlay network. In 
the context of the storage procedure, the committee is responsible for storing some data item I 
and creating sufficiently many (i.e. ^l(^/n)) randomly distributed storage landmarks for allowing 
fast retrieval of I by other nodes. If, on the other hand, u wants to retrieve item X, having a 
large number of search landmarks will significantly increase the probability of finding a sample of a 
storage landmark in short time. Due to churn, the number of landmark nodes (and the number of 
committee members) might be decreasing rapidly. Thus the committee members continuously need 
to replenish the committee and rebuild the landmark set. Note that we guarantee that the number 
of landmarks involved with a storage or search request remains in 0(n 1 / 2+<5 ), for any constant 
5 > 0, which ensures that our algorithms are scalable to large networks. 

4.1 Building Block: Electing and Maintaining a Committee 

We will now study how a node u can elect and maintain a committee of nodes in the network. 
Such a committee can be entrusted with some task that might need to be performed persistently 
in the network even after u is churned out. We for instance use such a committee in Section [4.31 to 
enable u to store a data item X so that some other node that needs the data may be able to access 
it well into the future without relying on u's presence in the network. While electing a committee 
is easy, we need to be careful to maintain the committee for a longer (polynomial in n) period of 
time because, without maintenance, the members can be churned out in 0(log fc n) rounds. 

In Section [3] we analyzed the setting where each node in the dynamic network initiates @(logn) 
random walks in round 1. We focused on a time span of ©(logn) rounds to study the mixing 
characteristics of random walks culminating in Theorem [3j Informally speaking, we showed that 
a large number of nodes in round r = t(Q, l/2n) received good samples of nodes currently in the 
network. This will help us create a committee of randomly chosen nodes, but churn can decimate 
this committee in 0(log k n) rounds. Our goal now, however, is to maintain a committee for a 
longer period of time — time that is polynomial in n. Towards this goal, we make each node 
initiate a logn random walks every round. Depending on how long we want the committee to last, 
we can fix an appropriately large a. Each random walk travels for 2r rounds; the node at which 
the random walk stops is called its destination. The destination node can use the source of the 
random walk as a sample from the set of nodes in the network. Since every node initiates some 
a logn random walks every round, Theorem [TJ is applicable in every round r ^ 2t. To formalize 
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this application of Theorem [H we parameterize Core with respect to time. We define CORE r to 
be the largest subset of V t ~ 2t n V such that for any s G CORE r and d G CORE 7 *, a random walk 
that starts from s (in round r — 2r) terminates in d (in round r) with probability in [l/17n, 3/2n]. 
From Theorem [TJ we know that Core 7 * has cardinality at least n — 0(n/log^ k ~ 1 ^ 2 n). When the 
value of r is clear from the context, we may avoid the explicit superscript. 
Algorithm [T] presents an algorithm that 

1. enables a node u G CORE r , r ^ 2r, to elect a committee of O(logn) nodes and 

2. enables the committee to maintain itself at a cardinality of G(log n) nodes despite 0(n/ log k n) 
churn. Moreover, the committee must comprise of at least 0(logn) nodes from the current 
Core. 

In Algorithm [H we assume that u is in the Core when it needs to create the committee. We show 
in Lemma [6] that, if u G Core, then, u will receive a sufficient number of random samples, so it 
chooses some hlogn samples to form the committee. To ensure that churn does not decimate the 
committee, every 0(logn) rounds, we re-form the committee, i.e., the current committee members 
choose a suitable leader (denoted c r in Algorithm [T|) that chooses a new set of committee members. 
The old committee members hand over their task to the new committee members and "resign" 
from the committee and the new members join the committee and resume the task they are called 
to perform. 

We begin our analysis of Algorithm Q] with a lemma that limits the number of random walks 
that any node receives in round r from nodes that are not currently in the Core. 

Lemma 5. Consider a churn rate of An/ \og k n. For any r ^ 2r and any u G V T , let B(u,r) be the 
number of random walks that started in round r — It from some node in Y t ~ 2t \ CORE r and stopped 

in u in round r. Then, E[B(u,r)] ^ 6alog~~2~~ n and Pr B(u,r) ^ 12alog^~ n ^ l/n 2a . 



Proof. Recall that we analyzed random walks in Q using Q. Consider node u G Q that corresponds 
to u G Q. Let v be some node in V r ~ 2r \ CORE r and let v be the corresponding node in Q. From 
Lemma [TJ(B), we know that a random walk that started in v will reach u with probability at most 
3/2n. Therefore, 

E[B(u,r)} ^ i?[number of random walks from y r_2r \ CORE r that reach u] 

(An \ 3 6a log n 3-k 

— sri— J— alogn= — jzr— = 6a\og 2 n . 
log 2 n J zn log 2 n 



Now using Chernoff bounds, 



Pr 



a — k 

B(u,r) ^ 12a log 4 n 



< Pr 



B > (1 + log V" n ) 



6a 



fc-3 

log 2 n 



^ exp(- 



6a log 2 n /nix 1 

~T=3 5 ) = exp(-2alogn) = 



1 ^ S 
log 2 n d 



n- 



"in our algorithm description, we assume for simplicity that c r is not churned out in round r + 2. We can handle 
the case where c r is churned out, by having the set S of the 0(logn) committee members that have received the 
largest number of random walks all perform the task of c r in parallel, i.e., each of them builds a new committee. 
Once these committee constructions are complete, the (survived) nodes in S agree on a single member c* of 5* and 
its committee Com*, and all other committees are dissolved. 
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Algorithm 1 Committee Maintenance and Construction for node u. 



Committee Creation . 

((Let r\ ^ 2r be the round when u must create Com. We assume that u G CORE n . Let h ^ a/36 
be a fixed constant.)) 

At round r\: Node u chooses hlogn sample ids and requests each node to join the committee 
Com. Therefore, Com <— {v\(v £ V Vl ) A v received an invitation from u}. Along with the 
request, u sends all the ids in Com to every node in Com. This enables the nodes in Com to 
form a clique interconnection. 

Committee Maintenance . 

((For every round r that is rounds after Com is created for every positive integer 7.)) 
At round r: The nodes in Com record the random walks they receive along with the source of 
each random walk. 

At round r + 1: The nodes in Com exchange the number of random walks they received in round 
r with each other. 

At the end of round r + 1: The number of random walks received by each node in COM is 
common knowledge among the members of COM. The node c r with the largest number of ran- 
dom walks is chosen to initiate the new committee (breaking ties arbitrarily yet unanimously). 
The choice of c r is now common knowledge among the nodes in Com. 

At round r + 2: The node c r chooses /ilogra random walks that stopped at c r in round r and 
inviteiPl their source nodes to form the new committee in round r + 3. Let Com* be the set 
of invited source nodes. Along with the invitation, the hlogn id's of all members of Com* 
are included. Therefore, the id's of nodes in Com* becomes common knowledge among the 
nodes in Com*. The nodes in Com cease to be members of the committee at the end of round 
r + 2. (If the situation calls for it, we may postpone the "resignation" of the current committee 
members; the overlap in membership can be used for ensuring smooth transition of the task 
performed by the committee.) 

At round r + 3: The members in Com* formally take over the committee. I.e. Com <— Com*. 
Each member of the new Com uses the id's of all other members to form a clique interconnection. 



□ 

While Lemma [5] limits the number of random walks that a node receives from nodes not in the 
Core, we also need to ensure that a node in the Core gets a sufficient number of random walks 
from other nodes in the Core. Thus, when we choose c r , we choose a node that received a large 
number of samples. From Lemma we know that only a small number of those samples can be 
from nodes not in the Core, so this ensures that the committee that we choose will be largely from 
the Core. 

Lemma 6. Consider any u G CORE r , r ^ 2r. With probability at least 1 — 1/n 1 , where t\ ^ a/144, 
u receives at least n random walks. 
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Proof. Let X be the number of random walks received by u in round r. From Theorem [H we know 
that E[X] al °| n . Using Chernoff bound, we get 



Pr 



alogn 
36 



Pr 



X ^ (1 - 1/2 



a log n 
18 



, alogn. , t 
<exp(-^^)<l/n*. 



□ 



Corollary 1. In Algorithm^ let c r be a node chosen in some round r + 1 to invite a new set of 
nodes to form the committee. With probability at least 1 — 1 /r/ 1 it received more than h log n random 
walks in the previous round since h a/36. Out of the hlogn random walks, with probability at 

least 1 — 1/n , at least hlogn — 12/ilog - * - n random walks originated in CORE 7 *. 

The hlogn nodes invited by c r in round r + 2 were certainly in the network in round r — It. 
From Corollary [H we also know that a large number of those nodes are in CORE r . We also need to 
ensure that most of them survive for another 2r rounds until round r + It. In particular, we want 
to ensure that those nodes that survive are largely in CORE r+2T . 



Lemma 7. Suppose I is the set of recipients of h log n invitations sent by some c in round r + 2 

5-fc 

4 n) 



(cf. Algorithm^). Then with probability at most 2/n 2h 
words, \I fl CORE r+2l ~| G h log n — o(log n) whp. 



|/\ CORE r+2r | G uj (log- 



in other 



5-fc 

at most 12/tlog * n 



Proof. From Corollary [H we know that with probability at most 1/n 

5-fc 

random walks did not originate in the Core . Obviously, these 12/ilog 4 n random walks did not 
originate from nodes in CORE r+2 ' r either. In this lemma, we are upper bounding the cardinality of 
I \ CORE r+2r . Therefore, in addition to the 12/ilog^^ n samples that we have already accounted 
as lost, we must bound the number of samples that we lose between round r to r + 2r due to churn. 
During this time period, a total of = ^^i" - nodes are churned out, since r = t(Q, l/2n) = 



M log n. Each node that is churned out is an opportunity for the adversary to churn out a node 
in I. Let X be the random variable that denotes the number of nodes in I that were churned out 
between rounds r to r + 2r. Consider a node i that is churned out. If i £ I, then the random walk 
from i reached c r , an event that can happen with probability at most 3/2n. Therefore, whenever 
the adversary churns out a node, it succeeds in churning out a node in / with probability at most 



3/2n. Therefore, E[X] (3/2n)(4Mn/ log 



fc-i 



n 



6m / log 



fc-i 



n. 



Pr 



X > 12v^log( 2 - fc )/ 2 



Pr 



< Pr 



^^,2^, k , 6m 
X > ( — — log 2 n) 



M 



log fc l n 



6m 



< exp( 



X^(l + ^logfn, ,_, 
VM log ft L n 

6/ilog fc n 



Jb-l 



3 log" ~n 
exp(-2/ilogn) = l/n 2h . 



(for sufficiently large n) 
(using Chernoff bound) 



Taking the union bound over the probability with which we lose X random walks plus the prob- 

5— fc 

ability with which we lose the at most 12/ilog 4 n random walks that did not originate in the 
CORE r , the result follows. □ 
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Recall that we have assumed that the node u that seeks to create the committee in round r±, 
r\ ^ 2t, is in Core 1 " 1 . Let Com 7 *, r ^ n, denote the set of nodes that consider themselves to be 
committee members in round r. We say that COM r is good if |COM r n CORE r | ^ (1 — e)h\ogn for 
any fixed e > 0. In the following theorem states that the committee that is created by u will be 
good for a suitably long period of time. 

Theorem 2. Fix e to be a small positive number in (0,1]. Recall that u creates the committee in 
round r\ . Let R ^ r\ be a random variable denoting the smallest value of r when COM r is not 
good and let Y be a geometrically distributed random variable with parameter p = {1/r/ 1 + 2/n 2h ) 6 
n -fi(i)_ Then, Y is smaller than R — r% + 1 in the usual stochastic order \5J$ . In other words, for 
every positive integer i, Pr\Y ^ i] ^ Pr[R — r% + 1 ^ i]. 

Proof. Let r + 2 be a round in which a new set of committee members are invited to (cf. Algorithm 
[I]) by a node c r E CORE r . We now show that the probability with which (i) the committee that 
is selected by c r is good and (ii) remains good until r + 2r + 2 (when the next set of committee 
members are selected) is high. The requirements of the theorem will then be subsumed. 

We now list some bad events; at least one of them must occur for the committee to cease to be 
good. 

1. With probability at most 1/n 1 , c r will receive fewer than /ilogn samples, (cf. Corollary [I]) 

5 — k 

2. With probability at most 1/n , more than 12Mog^~ n samples received by c r will not be 
in CORE r . (cf. Corollary [1]) 

3. With probability at most l/n 2h , more than 12m log'- 1_fe ^ 2 n nodes in COM r+2 will be churned 
out between r + 2 and r + 2r + 2. (cf. Lemma [7]) 

Thus, for r + 2 ^ r' < r + 2r + 2, COM r will not be good with probability at most p. Thus, 
the theorem follows. □ 

Corollary 2. Let £ be a suitably large number that respects the inequality p ^ n~ £ . Suppose at 
some round r + 2, a new set of committee members have been selected by c r 6 CORE r . Let g ^ be 
a random variable such that r + g + 2 is the first round after r + 2 when the committee ceases to be 
good. Then, E(g) ^ n . Furthermore, for any ^ i ^ I, Pr \g ^ n^ _l ] ^ n~ l . 

4.2 Building Block: Constructing a Set of Randomly Distributed Landmarks 

Once we have succeeded in constructing a committee of O(logn) nodes, we can extend the "reach" 
of this committee by creating a randomly distributed set of nodes that know about the committee 
members. An easy but inefficient solution is to simply flood the ids of the committee members 
through the network, which requires a linear number of messages to be sent. In this section, we 
will describe a more scalable approach (cf. Algorithm [2]) that constructs a set of £l{^/n) randomly 
distributed nodes that know the ids of the committee members and thus serve as "landmarks" for 
the committee. The basic idea is that every current committee member selects 2 of its received 
samples and adds them as children. These child nodes in turn then attempt to select 2 child nodes 
each and so forth. Taking into account churn, and the fact that only n — o(n) nodes are able 
to select random child nodes, we choose a tree depth that ensures with high probability that the 
committee members will succeed to construct a landmark set of size at least ^l.(^/n), but containing 
no more than 0(n 1//2+<5 log n) nodes. 

Due to the high amount of churn and the fact that the committee members change over time, 
the committee nodes are responsible for rebuilding the set of landmarks every O(logn) rounds, 
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Algorithm 2 Constructing a Random Set of Landmarks 

Assumption: There is a committee Com of O (log re) nodes each of which is carrying out some 
task T that requires all committee nodes to simultaneously start executing this algorithm. Task 
T can either be a retrieval of a storage request of some data item X" . 

Every r rounds do: 

1: Every committee node v tries to add ^/n randomly chosen nodes to the landmark set of X by 
constructing a tree: 

2: Node v contacts its ©(log re) received sample nodes and adds 2 nodes v± and V2 that are not 

yet part of the tree as its children (if possible). 
3: Nodes v\ and v-i in turn each select 2 (unused) nodes among their own samples as their children 

and so on. The nodes in the tree keep track of a tree depth counter \x that is initialized to 

and increased every time a new level is added to the tree. The construction stops at a tree 

depth of 

= log 2 re - 2 (log 2 log re + log 2) 

21 °^ (2(l " J-^) (l " i^) I 1 " £)) 

Note that nodes do not need to remember the actual tree structure. Every time a new level of 
f's tree is created, the parent nodes send all O(logn) committee ids to their respective 2 newly 
added children. 

4: Every node that has become a landmark for I, remains a landmark for 2t rounds and then 
simply discards any information about I. 



which will also ensure that the landmarks are randomly distributed among the nodes currently in 
Core. We define Core''' 1 ' 7 ' 2 ! as a shorthand for CORE ri n • • • D CORE r2 . 

Lemma 8. Consider any round r ^ 2r and suppose that some node u € CORE r executes Algo- 
rithm^ for storing item I and let T be the set of landmarks created fori. Then the following holds 
with high probability for a polynomial number of rounds starting at any round n ^ r + 2r. For 
ti = f \ + At, there exists a set M.% CTfl Core^ 1 ' 7 " 2 ' of landmarks such that every node in M.% is 
distributed with probability in [l/17n,3/2n] among the nodes in CORE[ ri ' r2 ^ and 

Vn ^ \M X \ ^ \T\ ^ 0(re a5+5 logn), (5) 

for any fixed constant 5 > 0. 

Proof. We will first argue the right hand side of ©, namely that the total number of landmark 
nodes is sublinear. To this end, we bound the maximal tree size of any tree created by a committee 
member. For any constant 5 > 0, there is a sufficiently large n, such that 

21082 ( 2 ~ b?^) (' " i 1 ~ n 5 )) * JTi' 

This allows us to bound the tree depth (cf. ([¥[)) as 

, log 2 n ( 1 \ 



2 
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In the worst case, all parent nodes in the tree construction always add 2 child nodes, yielding a 
tree size of at most 

2 (0.5+<5)log 2 n+l _l e 0(n°- 5+S ). 

Recalling that we have at most O(logre) committee members, w.h.p., we get the upper bound as 
stated in ©. 

For the lower bound on \M%\, we look at the trees created by the committee members. By 
Corollary [2] we have that, w.h.p., any node w G COM ri receives (1 — e)h\ogn G @(logn) samples 
that are distributed with probability [l/17n, 3/2n] among the nodes in Core^ 1 ' 7 " 2 ]. 

Consider any parent node v and assume that v G CORE[ ri ' r2 ^. We will bound the probability 
that a potential child node has already been chosen as a child by some other parent node in a tree. 
By the upper bound of ©, we know that there are at most 0(n 1 / 2+<5 logn) nodes in the tree at 
any point. Suppose that node v that has received a sample of some node w' and wants to add w' as 
its child. Since the sampling is performed by doing independent random walks, the event that w' 
has already been chosen as a child by some other node (possibly in a distinct tree) is independent 
from ui' being sampled by v. For sufficiently large n, we have 

„ r , . , , . , , -, 3n^ 2+s (l -s)hlogn 3(l-e)Mogn 
Pr \w is already m tree A w sampled by v ^ , , . 

2n n L / 2 ~ d 

Since the parent node v is in COREt ri ' r2 ^ , it follows by Lemma [7] that v has h'logn = 0(logn) 
samples in Core^ 1 ' 7 * 2 ' w.h.p. For h' ^ 8, the probability of v not receiving at least 2 unused child 
nodes is at most 

'3(l-e)Mogn\' l ' logn 1 1 



n' 



£ ^ 



1/2-5 J ^ n 21ogn ^ n 3 



Let Xi be the random variable that represents the number of nodes in M% up to (and including) 
tree level i; recall that, by definition, these nodes are in CoKE^ ri ' r2 K In addition to a fraction of 
(1 — -V) nodes that are lost due to already chosen child nodes, the expectation of Xj is reduced 
by a factor of at most (1 — log ( fc _ 1 1)/2 - to compensate for the nodes not in Core^ 1 ' 7 * 2 !, and by the 
nodes that are churned out during [ri,r2], which is at most (1 — t l-i )■ 

E [Xi] > 2E (l - j-p^-) (l - i -l_) (l - J,) . (6) 

By Corollary [21 we know that the expected committee size (i.e. E [Xq]) is at least (1 — e)h\ogn, 
which shows that 



E M> (l- £ ) W „ g „(2(l- ; -^)( 1 - i -l T _)(l-i I )) . (7) 



n 3 



To lower bound the expected size of Mj, we need to plug in the tree height (cf. (JH)) for i, which 
reveals that E [Xu] ^ 2y/n. We use a Chernoff bound to show the lower bound on Mj as required 
by ©. From Theorem 4.5 in [39], it follows that 



Pr 



, 21ogn . _ 
X^ (i__|-]2^ 



exp 



4-y/nlog n s 

2y/n J n 2 



which proves that Mj S £l(y/n) with high probability. □ 
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Algorithm 3 Persistently Storing a Data Item 



Node u issues an insertion request in round r for data X. 
1: Node u initiates Algorithm Q] to create a committee Com and requests the committee nodes to 
store X. Note that the committee nodes will continue to store X on u's behalf, even if u has 
long been churned out. 

2: Moreover, u instructs the committee members to execute Algorithm [2] and repeatedly create 
landmark sets of Q(y/n) nodes that will respond to retrieval requests of X. 



4.3 Storage and Retrieval Algorithms 

Now that we have general techniques for maintaining a committee of nodes and creating a randomly 
distributed set of landmarks for this committee (cf. Sections 14. II and 14. 2p . we will use these methods 
to implement algorithms for storage and retrieval of data items. 

Definition 1. We say that a data item X is available in round r, if the probability of any node in 
CORE^'' r+r ] to be in the current set of landmarks MX- is at least . 

It follows immediately from Corollary [2] and Lemma [8] that if a data item X is stored by a node 
u G CORE ri in some round r±, then X will be available in the network for a polynomial number of 
rounds starting from r%, with high probability. Clearly, the same is true for any later interval of 
polynomial number of rounds if I is available at its first round. 

For storing some data item X by some node u E Core, we combine the committee maintenance 
and landmark construction. In more detail, node u first creates a committee of 0(logn) nodes (cf. 
Algorithms [1]) , which will be responsible for storing the data item, i.e., every committee member 
will store a copy of X. The committee immediately starts creating a set of £l{y/n) landmark nodes, 
which know the ids of the committee members, but do not store X itself. Choosing these landmark 
nodes almost uniformly at random (cf. Lemma [8]) from the current CORE set, ensures that the 
committee members can be found efficiently by the data retrieval mechanism described below 

It follows immediately from Corollary [2] and Lemma [8] that if a data item X is stored by a node 
u € CORE ri in some round r±, then X will be available in the network for a polynomial number of 
rounds starting from n, with high probability. Owing to the memoryless nature of the persistence 
of the committee (cf . Theorem [2] and Corollary [2]) , the same holds with high probability for any 
later interval of polynomial number of rounds if the data was stored in a good committee at the 
start of the interval. 

Theorem 3 (Data Storage). Consider any round r ^ 2r. There is a set A of at least n — o(n) 
nodes, such that any data item X stored by a node in A via Algorithmic in round r is available for 
a polynomial number of rounds starting from round r + 2t, with high probability, in a network with 
churn rate up to 0{nj log 1+l5 n) per round. 

Conditioning on the fact that a data item X is available in some round rj, gives us a high 
probability bound that X will be available for another polynomial number of rounds, for any ^ r\ . 

Corollary 3. Suppose that Algorithm^ is executed for some data item X since round r\. If X is 
available in some round r j ^ n, then X will be available for a polynomial number of rounds starting 
from T{ with high probability. 
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Algorithm 4 Retrieval of a Data Item 



Node u issues a retrieval request in round r\ for data X. 
1: Node u initiates Algorithm [1] to create a committee Com which will automatically dissolve itself 
after 0(logn) rounds. 

2: Node u instructs the committee members to execute Algorithm [2] and repeatedly create a 
landmark set of £l(y/n) nodes. Every landmark node w contacts all nodes of received samples 
and inquires about X. If X is found, w directly reports this to u. 



For efficient retrieval of an available data item, we will again use the committee maintenance and 
landmark construction techniques. To distinguish between the nodes that are serving as landmarks 
or committee members for the storage procedures from the committee and landmark sets that are 
created for data retrieval, we will call the former storage landmarks, resp. storage committee and 
the latter search landmarks resp. search committee. 

When a node u G CORE r executes Algorithm |4] to retrieve some available data item X, it first 
creates a search committee via Algorithm [IJ which in turn is responsible for creating a set of 
£l(y/n) search landmarks. These search landmarks have high probability to be reached by any of 
the random walks originating from one of the storage landmarks that were previously created by 
the storage committee members. In more detail, we can show that with high probability, Vt{yjn) 
search landmark nodes are from the same core set from which the VL{y/n) storage landmarks have 
been chosen and therefore, within O(logra) rounds, a search landmark is very likely to get to know 
the id of one of the storage landmarks. 

Theorem 4 (Data Retrieval). Consider any round r\ ^ 2r. There is a set A of at least n — o(n) 
nodes, such that any available data itemX can be retrieved by any u G A via Algorithm^in O(logn) 
rounds, with high probability, in a network with churn rate up to 0{nj log 1+<5 n) per round. 

Proof. By assumption, item X is available in r\, wich means that every node v G CORE[ ri,ri+T ] has 
probability at least to be in Mp- . Moreover, by Corollary item X will still be available for 

a polynomial number of rounds with high probability. By Lemma we know that, after O(logn) 
rounds, the committee created by u has constructed a set T of Vl{s/n) nodes in Core^ 1 ^' 1 ^ that 
will report any encounter with a landmark of X to u. 

For any v G CORE[ ri ' ri+r ^ we know by Lemmas [5] and [6] that v receives at least one walk that 
originated from some w G CORE[ ri ' ri+r ^ w.h.p., thus the probability of v not getting to know the 
id of a landmark node M? 1 in round r is at most 1 — q^t^j- Applying the same argument to each 
of the £l{y/n) nodes in T, shows that the probability of none of them finding a landmark node for 

X is at most ^1 — Q^/^y) ~ ^ e - ^ 1 ). Note that, for the next t(g 0(logn)) rounds, we have 
the same probabilities for the nodes in T to encounter a landmark node of X. (This nodes are in 
CORE[ ri ' ri+r l by assumption, thus they will not be subjected to churn before round r\ + r.) It 
follows that, within O(logre) rounds, one node in T will receive a sample from a landmark of X and 
thus u will be able to retrieve X with high probability. □ 

4.4 Reducing the number of bits stored using erasure codes 

We can further reduce the total number of bits in storing large data items using the standard 
technique of erasure codes. We next describe how to incorporate such a technique in our scheme. 
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Given a data item X to be stored in the network, the algorithm described in the previous 
sections simply replicates X at a set of appropriate number of nodes in the network. The drawback 
of replication is the consumption of a high amount of network bandwidth and storage capacity. 
The other method is to apply erasure codes (e.g., Information Dispersal Algorithm (IDA) [46]) to 
encode a data item into a longer message such that a fraction of the data suffices to reconstruct the 
original data item. In particular, when applying IDA to storage systems, a data item X of length 
\X\ is divided into L parts, each of length \X\jK so that every K pieces suffice for constructing X. 
The total size of all piecies equals L\I\/K, and hence IDA is space efficient since we can choose 
the blowup ratio L/K, that determines the space overhead incurred by the encoding process, to be 
close to 1. 

Here, we show that our algorithm described in Section H] can be simply modified to apply erasure 
codes for storing data items in the network; however, the most challenge will be maintaining at 
least K pieces of X under node churn (the number of nodes storing pieces of X might be decreased 
rapidly with time). 

Suppose that a node u wants to carry out an insersion process of a data item X in round r\. 
First, u creates a committee Com 7 * 1 of h log n members for X by executing Algorithm^ applies IDA 
to split X into h log n pieces, each of size \I\/({h— 2) log n), and then requests each member of Com 7-1 
to store one of these pieces. Next, u instructs the committee members to execute Algorithm [5] and 
repeatedly create landmark sets of Q,{yjn) nodes that will respond to retrieval requests of X. Now, 
consider round r<i = r\ + r, in which members of Com 7 " 1 n V 1 " 2 execute Algorithm [1] to construct 
a new committee of X. We slightly modify the committee maintainace stage in Algorithm [T] as 
follows. We first bound the size of Com 7 " 1 n V T2 . Note that the probability of any node s G Core 7 * 1 
to be in Com 7 " 1 is bounded by [jt^^-]- Since the adversary is oblivous and the churn rate is 
Oinj log 1+f5 n), the probability of a node v S Com 7 * 1 is churned out in a later round r > n is at 
most 

A . 4n = 6 f8) 

2n log 1+S n log 1+f V 
By taking a union bound on ([8]) over rounds in [ri,^], we get that 

Pr [{v 6 COM ri ) A [v £ V' 2 )} < 

log n 

Let X be the random variable determined by the number of nodes in Com 7 * 1 that are subjected to 
churn in [n,^]. Since |COM ri | = hlogn, it follows that 



E [X] < 6h log 1 " 



n. 



By using a standard Chernoff bound, we get Pr [X ^ 2 log n > 6E [X]] ^ 2 2 '°g n 5 which shows that, 
with probability at least 1 — 0(n~ 2 ), the size of Com 7 " 1 is reduced by at most 21ogn in [ri,^], i.e., 

|CoM ri DV r2 \ > (fe-2)Iogn. 

Let c 7 * 2 be the node defined in Algorithm Q] which, by the definition, knows ids of all other nodes 
in COM ri n V 1 " 2 . Therefore, with high probability, X can be reconstructed at c 7 " 2 in round r2 + 1. 
At round r2 + 2, c 7 * 2 chooses hlogn random walks that stopped at c 7 * 2 in round r2 and invites their 
source nodes to form the new committee in r2 + 3. At the same time (round T2 + 2), c 7 * 2 reconstructs 
the original data item X, replicates X by applying IDA, and then requests, along the invitation of 
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joining the new committee, each candidates of the new committee to store one piece of the resulting 
parts. 

To retrieve a data item X, node u interested in X creates a committee, and then requests the 
committee members to execute Algorithm [4] and repeatedly create a landmark set of £l(y/n) nodes. 
Every landmark node w contacts all nodes of received samples and inquires about X. If a piece 
of X is found, say at node v, then w directly reports this to u. Note that v is a member of the 
committee storing X, and hence knows the ids of all other members of this committee. This enables 
u to contact the committee of X and to reconstruct the original item at u. 

5 Conclusion 

We have presented efficient algorithms for robust storage and retrieval of data items in a highly 
dynamic setting where a large number of nodes can be subject to churn in every round and the 
topology of the network is under control of the adversary. An important open problem is finding 
lower bounds for the maximum amount of churn that is tolerable by any algorithm with a sublinear 
message complexity. For random walks based approaches, we conjecture that there is a fundamental 
limit at o(n/ log n) churn, for the simple reason that if churn can be in order f2(n/logn), the 
adversary can subject a constant fraction of the nodes to churn by the time a random walk has 
completed its course. In this context, it will be interesting to determine exact tradeoff between 
message complexity and tolerable amount of churn per round. 
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